Introduction

This tutorial will walk you through the entire data science pipeline starting from data collection and processing, then moving on to exploratory data analysis and data visualization. Next, we will use hypothesis testing and machine learning to provide analysis. Lastly, we will show the messages covering insight learned during the tutorial. However, this tutorial will be focusing more on the data processing and analysis using visualization created using Pyplot and Plotly library. The Data Lifecycle

Loading Data

The data set we will use to analyze is the Homicide Reports (1980-2014) from FBI and FOIA which can be download here. The reason for choosing this data is because it contains many variables that we are able to do variety of analyzsis with from different angles. In addition, by analying the homicide reports and looking at the number of cases hopefully we can find the trends and be more aware of how series it can be.

I imported the data from the csv file and replaced any unknow data to np.nan, as seen from the code below.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("database.csv", low_memory=False)
df = pd.DataFrame(data)
#replace unknown data to np.nan
df.replace('Unknown',np.nan, inplace='true')
df.head()
Out[1]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... NaN Male 15 Native American/Alaska Native NaN Acquaintance Blunt Object 0 0 FBI
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... NaN Male 42 White NaN Acquaintance Strangulation 0 0 FBI
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... NaN NaN 0 NaN NaN NaN NaN 0 0 FBI
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... NaN Male 42 White NaN Acquaintance Strangulation 0 0 FBI
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... NaN NaN 0 NaN NaN NaN NaN 0 1 FBI

5 rows × 24 columns

Plotting and Analyzing

The colde below generates the graph of Number of Homicide by Year. In this plot, we are interested in seeing the number of cases each year from 1980 to 2014. To count the number of cases instead of directly counting the incident column, I have used the size() method of groupby to count the number of rows in each group because I thought it might be a little bit more accurate since 2 incidents can be between the same victim and perpetrator.

In [2]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
g = df.groupby('Year')
years = sorted(g.groups.keys())
size = g.size().values.ravel()
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='-', ms=5, color = 'purple', alpha = .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")

From the plot above we can immediately see that there is a huge declined in the number of homicide in the late 1990s. Unfortunately, after some research there still seems to be no definite answer for the caused of the declined. However, there are articles that talk about some of the hypothesis for the declined. The links are provided below:

Fitting a Linear Model

To help us have a better understanding of the plot, we can use sklearn that fit a linear regression model into the above graph.

In [3]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
#fitting the regresson model
x = years
y = size
x = np.reshape(x,(-1,1))
y = y.reshape(-1,1)
regr.fit(x, y)
fig, ax = plt.subplots()
ax.plot(years, size, marker='.', linestyle='None', ms=5, color = 'purple', alpha = .5)
ax.plot(years, regr.predict(x).ravel(), color='blue', alpha= .5)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 1))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Homicide')
ax.set_title("1980-2014 Number of Homicide by Year")
plt.xticks(rotation=90)
plt.show()
plt.close("all")

From the linear regression line that sklearn generated, we can see that even though there are few outliers around the year of 92 to 94 and 99-2000, the line has a negative slope. This means that the overall trend for number of homicide is decreasing as year passed. This leads us to the next part. Is there a relationship between the victim and the perpertrator.

For the next plot, we would like to see the relationship between the victim and the perpetrator. The "Other" category includes some of the closer relationships which are listed below:

  Neighbor   
  Boyfriend/Girlfriend     
  Friend      
  Family                
  Common-Law Husband      
  Common-Law Wife          
  Stepdaughter              
  Stepfather                
  Stepmother                
  Stepson                                                  
  Ex-Husband               
  Ex-Wife                  
  Employee                 
  Employer                   

We will group the data by using year and relationship between the victim and the perpetrator. It seperates into three category: stranger, acquaintance, and other. It will then count the total number for each category as seen in the code below.

In [4]:
g1 = df.groupby(['Year','Relationship'])
g1 = g1.size()

#Create a new dataframe that has the year as index and three colums indicting number of 'Stranger','Acquaintance', and 'Other'
df2 = pd.DataFrame(index = years, columns = ['Stranger', 'Acquaintance', 'Other'])
stranger = []
acq= []
other = []
y = 1980
s = 0
#counting the total number of specific relationship for each year
for index, series in g1.iteritems():
    
    if(index[1] == 'Stranger'):
        stranger.append(series)
    elif (index[1] == 'Acquaintance'):
        acq.append(series)
    else:
        s += series
    if(y == index[0]-1):
        y += 1
        other.append(s)
        s = 0
other.append(s)  

df2['Stranger'] = stranger
df2['Acquaintance'] =  acq
df2['Other'] = other

Bar Graphs

For this plot I had used pandas dataframe plot which is the same as pyplot but much more simpler if you already have an organized dataframe. You can learn more from here

In [5]:
f, ax1 = plt.subplots(1, figsize=(20,6))
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Homicide')
ax1.set_title("Relationship Between Victim and Perpetrator")
df2.plot.bar(stacked=True,ax=ax1, alpha = .5, width = .8, color =['#F4561D','#F1911E','#F1BD1A'])
plt.show()

From the above bar graph, we can see the decrease number of homicide in all three category of relationships we analyzed. However, while the number of cases start off pretty close for closer relationships and acquaintance, we can see the decreasing trend for the acquaintance is more obvious than other. We can also see that the number of homicide casued by stangers did not seems to decrease a lot compare to other two, the number even become very close to the number of acquaintance starting around 2000.

After we have seen the number of homicide by the relationships between victim and perpetrator we might also want to know the gender of the victim and perpetrator, so we have included the graphs for the number of homicide victim and perpetrator by gender.

In [6]:
g2 = df.groupby(['Year','Victim Sex'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim 
g2 = g2.size().values.reshape(35,2)
df3 = pd.DataFrame(index = years, columns =['#Female Victim','#Male Victim'], data=g2)

f, ax2 = plt.subplots(1, figsize=(20,6))
ax2.set_xlabel('Year')
ax2.set_ylabel('Number of Victim')
ax2.set_title("Sex of Homicide Victim")
df3.plot.bar(ax=ax2, color=['r','b'],alpha=0.5, width=0.8)

g3 = df.groupby(['Year','Perpetrator Sex'])
g3 = g3.size().values.reshape(35,2)
df4 = pd.DataFrame(index = years, columns =['#Female Perpetrator','#Male Perpetrator'], data=g3)

f, ax3 = plt.subplots(1, figsize=(20,6))
ax3.set_xlabel('Year')
ax3.set_ylabel('Number of Perpetrator')
ax3.set_title("Sex of Homicide Perpetrator")
df4.plot.bar(ax=ax3, color=['r','b'],alpha=0.5, width=0.8)
plt.show()

It might not be very surprising to see male number to be much more than female, but it is interesting to see that the trend for victim and perpetrator seems almost identical. From the resulted graphs, we also noticed the number for the perpetrator is less than the number of victim which is because the lack of information of the perpetrator for the unsolved cases. So, for the next plot we will show the percentage of homicide solved each year.

To get the percentage of crime solved each year, we grouped the data by the year and crime solved. The data are seperated by whether or not the cases have been solved.

In [7]:
g4 = df.groupby(['Year','Crime Solved'])
#reshape panda series to one column of # of Female Victim and one column of # Male Victim 
g4 = g4.size().values.reshape(35,2)
df5 = pd.DataFrame(index = years, columns =['#Not Solved','#Solved'], data=g4)

#calculate the crime solve percentage
df5['Crime Solved %'] = (df5['#Solved']/(df5['#Solved']+df5['#Not Solved'])*100)

After that, plot the data with linear regression line to generate the graph below.

In [8]:
x = years
y = df5['Crime Solved %']
x = np.reshape(x, (-1,1))
y = y.values.reshape(-1,1)

regr = linear_model.LinearRegression()
#fitting the regresson model
regr.fit(x, y)

fig, ax5 = plt.subplots()
ax5.plot(df5.index, df5['Crime Solved %'], marker='.', linestyle='None', ms=5, color = 'orange')
ax5.xaxis.set_ticks(np.arange(start, end, 1))
ax5.set_xlabel('Year')
ax5.set_ylabel('Percentage of Homicide Solved')
ax5.set_title("1980-2014 Percentage of Homicide Solved by Year")
# plot the regression line
ax5.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)   
plt.xticks(rotation=90)
plt.show()

The result was a little bit unexpected for me. At first, I thought we can see a clear linear line of increase percentage of solved homicide cases because of the technology improvement and since we can learn from previous experience. However, we can see from the regression line, the percentage of solved cases is actually decreasing as time passed. The reason could be because perpetrators also have the knowledge and the technology that make the cases harder to solve.

Analyze by States

Choropleth Map

Choropleth map is another power technique that provides strong visualization. Below we will use the choropleth map to show the total number of homicide cases from 1980 to 2014.

In [9]:
g5 = df.groupby('State')
g5 = g5.size()

# since we can not show DC in the 50 states map we decided to add the number of DC into Maryland since DC is located at Maryland
maryland_total = g5.get('District of Columbia') + g5.get('Maryland')
g5.set_value('Maryland',maryland_total)

# We will need to translate the states name into state code for the plotly map to process the data
code = ['AL','AK', 'AZ', 'AR','CA', 'CO','CT','DE','DC','FL',
    'GA', 'HI', 'ID','IL','IN','IA', 'KS','KY', 'LA', 'ME', 'MD','MA','MI',
     'MN', 'MS', 'MO','MT', 'NE', 'NV', 'NH','NJ','NM','NY', 'NC', 'ND', 'OH',
     'OK', 'OR','PA','RI','SC', 'SD','TN', 'TX','UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
g5.keys= code

After taking the data, and grouping them by states, we were able to generate the map below using plotly:

In [10]:
import plotly
#must added this code to use plotly offline so you do not have to have an account with plotly
plotly.offline.init_notebook_mode()

#we created the purple color scale to use
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = g5.keys,
        z = g5.values,
        locationmode = 'USA-states',
    
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Number of cases")
        ) ]

layout = dict(
        title = '1980 - 2014 Number of Homicide by State',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )

From the result, we can make a prediction that states that have higher population also have higher number of cases. To prove our hypothesis, we have also gather the state population data from Census The state_poulation data was gathered by looking up state population each year from 1980 to 2014 manually and put it in the excel document.

In [11]:
popdf = pd.read_excel("state_population.xlsx")

#adding extra columns to include number of cases which can be use later
popdf['Cases']= g5.values 
#same as the pervious data we also want to add the data of DC into MD and take the average of each year population
m = popdf.loc[popdf['State']=='MD'] 
d = popdf.loc[popdf['State']=='DC'] 
t = m.values+d.values
t = np.delete(t,0)
t = np.delete(t,35)
a = np.mean(t)
popdf.set_value(20,'Average', a)
popdf.head()
Out[11]:
State 1980 1981 1982 1983 1984 1985 1986 1987 1988 ... 2007 2008 2009 2010 2011 2012 2013 2014 Average Cases
0 AL 3893888 3918531 3925266 3934102 3951820 3972523 3991569 4015264 4023844 ... 4672840 4718206 4757938 4785298 4799918 4815960 4829479 4843214 5.217157e+06 11376
1 AK 401851 418491 449606 488417 513702 532495 544268 539309 541983 ... 680300 687455 698895 713985 722713 731089 736879 736705 6.055167e+05 1617
2 AZ 2718215 2810107 2889861 2968925 3067135 3183538 3308262 3437103 3535183 ... 6167681 6280362 6343154 6413737 6467163 6549634 6624617 6719993 4.689746e+06 12871
3 AR 2286435 2293201 2294257 2305761 2319768 2327046 2331984 2342355 2342656 ... 2848650 2874554 2896843 2921606 2939493 2950685 2958663 2966912 2.579688e+06 6947
4 CA 23667902 24285933 24820009 25360026 25844393 26441109 27102237 27777158 28464249 ... 36250311 36604337 36961229 37349363 37676861 38011074 38335203 38680810 3.210574e+07 99783

5 rows × 38 columns

In [12]:
plotly.offline.init_notebook_mode()
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = popdf['State'],
        z = popdf['Average'],
        locationmode = 'USA-states',
    
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Population")
        ) ]

layout = dict(
        title = '1980 - 2014 Average Population by State',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.offline.iplot( fig )

The result from the average population data agrees with our hypothesis and looks almost identical as the pervious choropleth map.

To have a better view of the relationship between number of homicide by state and the average population by state, we can again fit a linear model to draw the regression line between them.

In [13]:
fig, ax6 = plt.subplots()
ax6.plot(popdf.Cases,popdf.Average, linestyle='None', marker='.' )
x = popdf.Cases
y = popdf.Average
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)

regr = linear_model.LinearRegression()
regr.fit(x, y)

ax6.plot(x.ravel(), regr.predict(x).ravel(), color='blue', alpha= .5)  
#removing auto offset and scientific notation fo large number
ax6.ticklabel_format(useOffset=False, style='plain')
ax6.set_xlabel('Number of Homicide')
ax6.set_ylabel('Average State Population')
ax6.set_title("1980-2014 Total Number of Homicide vs. States Average Population")
plt.show()

The result clearly shows a positive relation between the number of homicide and the population. As the increase of the population increase the number of homicide cases also increase.

Motion Bubble Chart

Motion bubble chart is another visualization tool that can help us to see the relationship between different variances. For our motion bubble chart we will show the how each state’s population and number of homicide changed each year.

Handle Missing Data

To prepare our data for the motion bubble chart, we will need to rearrange our data to get the number of cases each year for each state. However, we have found out that we did not have the data of some of the states for some of the years. We have looked back to the original data source, but it did not mention about the missing data, so we are not able to find out why are the data missing….

In [14]:
import queue 

dfState = df.groupby(['State','Year']).size()

#find and fill missing data using queue to check the years range for each state
years = queue.Queue()
#range is inclusive for the start values and exclusive for the end value
for j in range(1980,2015):
    years.put(j)
#iter rows in the dataframe and find the years that each states is missing a data and add np.nan to it
for i, row in dfState.iteritems(): 
    if(years.empty()):
        for j in range(1980,2015):
            years.put(j)
    y = years.get()
    if(type(i) != int):
        if(i[1] != y):
            for x in range(y, i[1]):
               
                dfState.loc[(i[0],x)] = np.nan
                y = years.get()
            

#tranfer pandas series to data frame
dataState = dfState.to_frame('Crime')

#making year and state columns
dataState = dataState.reset_index()
#sort dataframe first by year then by state
dataState.sort_values(by=['Year', 'State'], inplace=True)

#now we want to add the population data to our dataframe
#drop the unused columns
temp_pop = popdf.drop('Average',1)
temp_pop.drop('Cases', 1,inplace=True)
temp_pop.drop('State', 1,inplace=True)
temp_pop = temp_pop.transpose()
pop = temp_pop.as_matrix()

dataState['Population'] = pop.reshape(1785,1)
dataState['pop'] = dataState['Population']
#rearrange to use it in the bubble chart
dataState = dataState[['Year','pop','Crime','Population','State']]

To handle the missing data, we have decided to discard all the missing data, so the number of crime cases will be 0 for the data that is missing.

Finally we can import motionchart and pass dataframe to the motion chart The bubble size will be determine by the size of the state population The y-axis for the motion chart will be the population and year, the x-axis will be the number of crime

You can click on the bubble that you want to know the name of, also you can move the mouse on the bubble to see more information.

In [15]:
from motionchart.motionchart import MotionChart, MotionChartDemo
mChart = MotionChart(df=dataState, title = "Crime Cases by States")
mChart.to_notebook()

Motion bubble chart has allowed us to see some of the outliers, which includes California, Taxes, New York, Florida, and Illinois. It is very interesting to see that Texas population caught up and eventually became larger than New York. However, Taxes had more crime cases then New York even before the population is smaller than New York. Furthermore, we can also see that as Florida also caught up with New York population, it also had more crime cases than New York. It seems like the fast growing state could have more crime cases than other state.

Summary and References

This tutorial only highlighted some of the basic elements of data analyzed in Python. Much more different ways of handling and analyzing data can be done. Especially when it comes to handling missing data and machine learning. More detail and tools are available from the following links.

Data:

  1. Kaggle - Homicide Report,1980 -2014: https://www.kaggle.com/murderaccountability/homicide-reports
  2. Census: https://census.gov/topics/population/data.html

Visualization Tool:

  1. Pyplot: https://matplotlib.org/users/pyplot_tutorial.html
  2. Pandas Matplotlib: http://pandas.pydata.org/pandas-docs/stable/visualization.html
  3. Plotly: https://plot.ly/python/
  4. Motion Bubble Chart:https://github.com/hmelberg/motionchart

Others :

  1. Murder Accountability Project:http://www.murderdata.org/
  2. https://www.theatlantic.com/politics/archive/2016/04/what-caused-the-crime-decline/477408/
  3. http://pricetheory.uchicago.edu/levitt/Papers/LevittUnderstandingWhyCrime2004.pdf
  4. Handling Missing Data : http://pandas.pydata.org/pandas-docs/stable/missing_data.html
  5. Sklearn : http://scikit-learn.org/stable/